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Abstract 

We describe Information Forests, an approach to classification that 
generalizes Random Forests by replacing the splitting criterion of non- 
leaf nodes from a discriminative one - based on the entropy of the label 
distribution - to a generative one - based on maximizing the information 
divergence between the class-conditional distributions in the resulting par- 
titions. The basic idea consists of deferring classification until a measure of 
"classification confidence" is sufficiently high, and instead breaking down 
the data so as to maximize this measure. In an alternative interpretation, 
Information Forests attempt to partition the data into subsets that are "as 
informative as possible" for the purpose of the task, which is to classify 
the data. Classification confidence, or informative content of the subsets, 
is quantified by the Information Divergence. Our approach relates to ac- 
tive learning, semi-supervised learning, mixed generative/discriminative 
learning. 

1 Introduction 

We introduce Information Forests (IFs), a family of part-based classifiers de- 
signed for problems that are not easily solvable as a whole. In IFs there is 
a hidden location or selection variable that is key to performing classification: 
While there may be no distinguishing characteristic between the positive and 
negative samples considered as a whole, one can find "informative subsets" (re- 
gions, parts, or groups) where classification is simple to carry out. However, IFs 
are not restricted to these problems, and can be interpreted as a generic family 
of classifiers that includes Random Forests (RFs) as a special case. 

The motivation comes from problems such as detection of people in images, 
where the distribution of intensity or color values in the region occupied by 
a person is not discriminative, and could be identical to the distribution of 
intensity or color values outside the same region. However, when the problem is 
restricted to smaller regions, or "parts," the problem may be more easily solved. 



1.1 Intuition 



The key idea of Information Forests is to defer attempts to classify data points, 
and focus first on grouping them in a way that makes classification as simple as 
possible. In other words, the goal at the outset is not to partition the data into 
clusters that are as "pure" as possible (belonging to the same class). Instead, 
the goal is to partition the data into clusters that are as simple as possible 
to classify down the line, and only perform the classification when it becomes 
sufficiently simple. In other words yet, the focus is to break down the original 
classification problem (for the entire dataset) into smaller subsets that are as 
simple as possible to classify. Only when the classification problem is "simple 
enough" it is actually carried out. Otherwise, the grouping process proceeds 
in a recursive, hierarchical fashion. In this divide- et-impera scheme, the goal is 
to determine groups of data that are as informative as possible for the purpose 
of the task, which is the determination of the class label A. Such groups can 
be considered "regions" or "parts" or "subsets" depending on the application. 
This is illustrated in Fig. [I] 




Figure 1: Random Forest vs. Information Forest. A sequence of n groups al- 
ternating positive/negative/positive/negative etc. partitioned using a Random 
Forests with linear stumps requires a number of levels that grows linearly with n 
(left). An Information Forest using the same stumps (right) does not try to clas- 
sify samples immediately but instead tries to partition them into groups that 
are simple to classify, and defers the decision until conhdence r is sufficiently 
high and information gain 8 sufficiently small. 



1.2 Formalization 

Let A € {0,1} be a binary class label, x € D C R k , with k = 2,3 a location 
variable, and y : D —> Y, x H> y(x) a measurement (or "feature") associate 
to location x, that takes values in some vector space Y. When the domain 
D is discretized (e.g., the planar lattice), x can be identified with an index 
i € A | x i € D. In that case, we indicate y{x) simply by yi. 

A (binarjj^]) segmentation problem consists of partitioning the spatial domain 

1 Extcnsion to multi-class segmentation, where A £ {1,2, ...,M} is straightforward and 
will therefore not be considered here. 
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D into two regions, f2 and D\Q, according to the value of the feature y(x). This 
can be done by considering the posterior probability 



P(X\y)cxp(y\X)P(X), (1) 

where the first term on the right hand side indicates the likelihood, and the 
second term the location prior. It should be clear that meaningfully solving 
this problem hinges on the two likelihoods, p(y\X = 1) and p(y\X — 0) being 
different: 

p(y|A = l)^p(|/|A = 0). (2) 

If this is the case, we can infer A and, from it, f2 = {x \ X(x) = 1}. However, there 
are plenty of examples where where ^ is violated. We refer to problems where 
the condition |2]) is violated as problems that "are not solvable as whole", in the 
sense that we cannot segment the spatial domain simply by comparing statistics 
inside f2 to statistics outside. Nevertheless, it may be possible to determine 
parts, or local regions Si C D, within which the likelihoods are different: 

3 {^}f =1 | P (y\x G S j7 X = l)^ P (y\x g S„X = 0), 

Sj CD, j = 1, . . . , N. (3) 

Note that the collection {Sj} is not unique, does not need to form a partition of 
D, as there is no requirement that Si D Sj ^ for i ^ j, so long as the union of 
these regions covei^D. The regions Sj do not even need to be simply connected. 
In some applications, one may want to impose these further conditions. 

In the discrete-domain case, we identify the index i with the location Xi, so 
the regions become subsets of the data. With an abuse of notation, we write 

Sj = {h,i2,..-,i nj }- ( 4 ) 
Therefore, we write the two conditions ^ as 



p(yi\Xi = 1) ^p{yi\X, = 0), 



p(yi\i G Sj,Xi = 1) ^p(yi\i G Sj,X t = 0). (5) 



Assuming these conditions are satisfied, we can write the posteriors by marginal- 
izing over the sets Sj, 

p(X\ yi ) cx Y,P(Vi I 1 e Sj,X)P(i G ^|A)P(A) (6) 

3 

or by maximizing over all possible collections of sets {Sj}. In either case, the 
sets Sj are not known, so the segmentation problem is naturally broken down 
into two components: One is to determine the sets Sj, the other is to determine 
the class labels within each of them: 



2 Indeed, even this condition can be relaxed to assuming that these regions cover the bound- 
ary of Q, UjSj D dQ, by making suitable assumptions on the prior p(\\x). 
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Given a training set of labeled samples {yi, A,}-^ 1 , 

Find a collection of sets {Sj}^^ such that Sj C D and D C U/Sj, that are "as 
informative as possible" for the purpose of determining the class label A. 

If the sets arc "sufficiently informative" of ft, perform the classification; that 
is, determine the label A within these sets. 

The key condition translates to the restricted likelihoods p{t)i\i € Sj,X = 1) 
and p(yi\i € Sj, A = 1) being "as different as possible" in the sense of relative 
entropy (information divergence, of Kullback-Liebler divergence). When they 
are sufficiently different, the set is sufficiently informative of f2, and classification 
can be easily performed by comparing likelihood or posterior ratios. 

This problem relates to active learning, in the sense that the classifier has 
to select, among all possible subsets, the ones that are informative in the sense 
of enabling the classification A. A possible approach would be to select Si at 
random. However, an active learner would want to choose, among all possible 
Si, the ones that are most informative towards solving the original classification 
problem, that is to determine A. It also relates to semi-supervised learning with 
model selection, since - in addition to determining the discrete variable A for 
which supervision is provided via the training set - one has to determine the 
sets Sj, that can be interpreted as groupings, or collections, or subsets of the 
training data. However, no supervision is given as to which point belongs 
to which group Si . In addition, the number of such regions TV is not known and 
has to be inferred (model selection). This problem also touches on the issue 
of generative/discriminative models, since the groups Sj can be interpreted as 
generative (latent mixture model), while the ultimate goal is classification. 

Information Forests implement the program above using the machinery of 
boosting and decision trees, as we describe next. 

2 Derivation of Information Forests 

Information Forests are a family of classifiers that accomplish the goals described 
in the previous sections using the tools of randomized trees. 

The groups ("clusters", or "regions") Sj C D are chosen within a class <S 
defined by a family of simple classifiers (decision stumps). For convenience, we 
expand the index j into two indices, one relating to the "features" fj and one 
relating to a threshold 9 k . We then define, for a continuous location parameter 
x 

S jk = {x€D\ fj(x,y)>e k } (7) 

where the feature / : D x Y — > R; (x, y) i-> f(x, y) is any scalar-valued statistic 
and the threshold 6 e R is chosen within a finite set. We call the set of features 
T = {fj} and the set of thresholds 6 = {9k}- The complement of Sj k in D is 
indicated with Sj k = {x e D \ fj(x,y) < 9 k } = D\Sj k - In the simplest case, 
for a grayscale image, we could have f(x,y) — y(x) where y(x) is the intensity 
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value at pixel x. More in general, / can be any (scalar) function of y in a 
neighborhood of x. For the discrete case, where i is identified with the location 
Xi, with an abuse of notation we write 

S jk = {*e A | frfa) >6 k } (8) 

and again S^ k = {i G A | fj(yi) < k }. Here the features / are / : A x Y — > 
K; (i, y) i-> /($/»)• Specifying the feature and threshold (fj,0k) is equivalent to 
specifying the set Sjk and its complement Sj k . 

We are interested in building informative sets using recursive binary parti- 
tions, so at each stage we only select one pair {Sjk,Sj k }. Among all features 
in T and thresholds in 8, Information Forests choose the one that makes the 
set Sjk "as informative as possible" for the purpose of classification. From ([5| 
it can be seen that the quantity that measures the "information content" of a 
set Sjk (or a feature fj,9 k ) for the purpose of classification is the Information 
Divergence (Relative Entropy, or Kullback-Liebler Divergence) between the dis- 
tributions p(yi\i G Sjk,Xi — 1) and p(yi\i G Sjk,Xi = 0). In short-hand, we 
write p(y t \ ■ ■ ■ , Xi = 1) as pi{y%\ ■■■) and p(yi\ ■ ■■ , Xi = 0) as po(yi\ ■•■) and 

KL(fj,0 k ) = SfOLfafali G S) || p ( yi \i G 5)) + 

+ ^KLG^Iz G S c ) || pofel* G S c )). (9) 

From the characterization of the sets Sjk, i G Sjfc is equivalent to fj(yi) > 9k, 
so we write Sjk = S(fj,8k)- Therefore, a decision stump ("KL-node") chooses 
among features and thresholds one (of the possibly many) that 

\s(fj,e k )\ 



fj,6 k = argmax 

KL( Pl (yi\fj > 6, 

°' ' (p x { Vi \fj < 6k)\\p(yi\fj < 6 k )) . (10) 



■/>■ |D| 

KL(pi(yi|/i>^)lbo(yi|/i>^)) 



Here KL(p| |g) = E p In | = J In |dP denotes the Kullback-Liebler divergencej^j 

The normalization factors |5|/|D| and |5' C |/|Z3| count the cardinality of the set 
S and its complement relative to the size of the domain D. 

If the divergence value is sufficiently large, KL(fj,9 k ) > r, the positive and 
negative distributions are sufficiently different, and therefore the classification 
problem is easily solvable. To actually solve it, one could use the same decision 
stumps (features) J- ', but now chosen to minimize the entropy of the distribution 



3 Several alternate divergence measures can be employed instead of Kullback-Leibler's, for 
instance symmetrized versions of it, or more general Jeffrey divergence. 
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of class labels, p{Xi\i £ Sjk) = p{\\fj > ^fc), and its complement: 

H{j J ,e k )= m ^ w{\ l \f J >e k )+ 



where H(p) = E p [lnp] = JlnpdP is the entropy of the distribution p. If the 



quantity ( 10 ) is sufficiently large, KL(/j, 9k) > t, ( 11 ) can be solved. If not, the 
process can be iterated, and the data further split according to the same crite- 
rion, the maximization of KL(fj,9k)- The value r can therefore be interpreted 
as measuring the least tolerable confidence in the classification. 



2.1 Implementation 

Information Forests perform hierarchical grouping (mixture modeling) and clas- 
sification by recursive binary partitioning. During training, starting from a the 
entire dataset {1, . . . , N}, each node S is passed through a Divergence Test: 

KL(pi(Vi\i £ S) || po(ifc|t G S)) > t. (12) 
If this condition is satisfied, the node is designated as an H-node that solves 

/ i( fc = arg mm i_H(j.O) (13) 



If the Information Gain is below a minimum threshold <5 > 0, 

H(A I |ie5)-i/(/ J ,0 fe )<<5, (14) 

the node is re-designated as a terminal node ( "leaf" ) and the classes are deter- 
mined via 

A = arg max p(Xi\i £ S). (15) 

Ai6{0,l} 



If the condition ( 12 1 is violated, the two classes are difficult to separate, so we 



look to partition the data into new clusters via a KL-node that solves 



fj,9 k = arg max KL(f, 9) (16) 



In either case, so long as the node is not a leaf, the selected fj , 9k generates two 
sets, S(fj,9k) and its complement, where 

S(f j ,h) = {ieS\f j {vi)>h}. (17) 

The two sets S = S(fj,9k) and S — S c (fj,9k) are fed each to one of the 
two children of the current node as the tree grows. Like in a Random Forest, 
the process is repeated multiple times, for random subsets of the data points. 
During testing, each datum is run through the cascade of tests fj(yi) > 9 k , 
on multiple trees, and then voting is performed. 
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2.2 Approximation and lower bound 

While testing consists of repeated scalar tests that have trivial computational 
complexity, training requires multiple iterations of exhaustive optimization at 
each node, where each step entails computing KL(f,6), that is a relative en- 
tropy between distributions in high-dimensional space (the feature space Y). 
Therefore, efficient approximations are needed. 

One could employ several proxies of relative entropy, including Fisher scores. 
Or, one could compute relative entropy between scalar components (projections) 
of feature space. We approximate the Information Divergence with a lower 
bound 

KL(pi(lh|/j > Oj) || vo{Vi\fj > Oj)) > 

> KL(pi(n(|fc)|£ > 9j) || VoW)\fi > Oj)) (18) 

where Tl(yi) is any 1-D projection of yi. For ease of computation, we choose 
= f(yi) from our feature pool. Since the previous inequality holds for any 
II, we have 

KL(Pi(lfc|/j > 9j) || poiVilfj > Oj)) > 

> w*u. KLfa(f(Vi)\fj > Oj) || MfiVi)\fi > Oj))- (19) 

This process is repeated according to the same schedule of conventional Random 
Forests. 

2.3 Analysis 

Information Forests are a superset of Random Forest, as the former reduces to 
the latter when r = is chosen. While it has been argued Q] that RF produce 
balanced trees, this is true only when the class J- is infinite. In practice, T 
is always finite, and typically RFs produce heavily unbalanced trees, as the 
example in Fig. [T] illustrates. That example also shows that, when the dataset 
is not separable by the class of decision stumps, IFs produce more balanced and 
shallower trees when the set of classifiers is restricted. 

More thorough analysis of the properties of IFs and the class of problems 
they are well matched to solve is forthcoming. 

3 Discussion 

Random Forests as a boosting variety of randomized decision trees, have been 
employed with a variety of splitting criteria, mostly related to entropy of the 
label distributions or mutual information between the features and the labels 
[3 IH1 |2] • Breiman analyzes some of the properties of entropy and compares 
it with the Gini index in [T]. However, to the best of our knowledge, all of 
these approaches choose discriminative splitting criteria, where the goal is to 
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produce partitions that are as pure as possible at each node, and there is no 
differentiation between leaf nodes and non-leaf nodes. 

Several choices of decision stumps have also been applied, mostly depending 
on the application, with the simplest choices consisting of linear classifiers [3]. 
We have used simple linear scalar stumps for simplicity, but there is nothing in 
the derivation of IFs that precludes the use of more complex classifiers (other 
than computational considerations). 

Since our approach mixes divergence measures and classification measures, 
the analysis of Nguyen et al. [4] could shed some light on the properties of the 
scheme proposed. 

In forthcoming work, we intend to characterize the performance of IFs both 
empirically, as well as analytically. 
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